The Ex Project: Web Information Extraction Using Extraction Ontologies
نویسندگان
چکیده
Extraction ontologies represent a novel paradigm in web information extraction (as one of ‘deductive’ species of web mining) allowing to swiftly proceed from initial domain modelling to running a functional prototype, without the necessity of collecting and labelling large amounts of training examples. Bottlenecks in this approach are however the tedium of developing an extraction ontology adequately covering the semantic scope of web data to be processed and the difficulty of combining the ontology-based approach with inductive or wrapper-based approaches. We report on an ongoing project aiming at developing a web information extraction tool based on richly-structured extraction ontologies and with additional possibility of (1) semi-automatically constructing these from third-party domain ontologies, (2) absorbing the results of inductive learning for subtasks where pre-labelled data abound, and (3) actively exploiting formatting regularities in the wrapper style. Martin Labský Dept. of Information and Knowledge Engineering, University of Economics, Prague, W. Churchill Sq. 4, 130 67 Praha 3, Czech Republic e-mail: [email protected] Vojtěch Svátek Dept. of Information and Knowledge Engineering, University of Economics, Prague, W. Churchill Sq. 4, 130 67 Praha 3, Czech Republic e-mail: [email protected] Marek Nekvasil Dept. of Information and Knowledge Engineering, University of Economics, Prague, W. Churchill Sq. 4, 130 67 Praha 3, Czech Republic e-mail: [email protected] Dušan Rak Dept. of Information and Knowledge Engineering, University of Economics, Prague, W. Churchill Sq. 4, 130 67 Praha 3, Czech Republic e-mail: [email protected]
منابع مشابه
Presenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملOntology-Based Information Extraction
Information Extraction (IE) aims to retrieve certain types of information from natural language text by processing them automatically. Ontology-Based Information Extraction (OBIE) has recently emerged as a subfield of Information Extraction. Here, ontologies which provide formal and explicit specifications of conceptualizations play a crucial role in the information extraction process. Because ...
متن کاملCombining Multiple Sources of Evidence in Web Information Extraction
Extraction of meaningful content from collections of web pages with unknown structure is a challenging task, which can only be successfully accomplished by exploiting multiple heterogeneous resources. In the Ex information extraction tool, so-called extraction ontologies are used by human designers to specify the domain semantics, to manually provide extraction evidence, as well as to define ex...
متن کاملUse of Ontologies for Cross-lingual Information Management in the Web
We present the ontology-based approach for crosslingual information management of web content that has been developed by the EC-funded project CROSSMARC. CROSSMARC can be perceived as a meta-search engine, which identifies domainspecific information from the Web. To achieve this, it employs agents for web crawling, spidering, information extraction from web pages, data storage, and data present...
متن کاملTypes and Roles of Ontologies in Web Information Extraction
We discuss the diverse types and roles of ontologies in web information extraction and illustrate them on a small study from the product offer domain. Attention is mainly paid to the impact of domain ontologies, presentation ontologies and terminological taxonomies.
متن کامل